Spaced seeds improve k-mer-based metagenomic classification

نویسندگان

  • Karel Brinda
  • Maciej Sykulski
  • Gregory Kucherov
چکیده

MOTIVATION Metagenomics is a powerful approach to study genetic content of environmental samples, which has been strongly promoted by next-generation sequencing technologies. To cope with massive data involved in modern metagenomic projects, recent tools rely on the analysis of k-mers shared between the read to be classified and sampled reference genomes. RESULTS Within this general framework, we show that spaced seeds provide a significant improvement of classification accuracy, as opposed to traditional contiguous k-mers. We support this thesis through a series of different computational experiments, including simulations of large-scale metagenomic projects.Availability and implementation, Supplementary information: Scripts and programs used in this study, as well as supplementary material, are available from http://github.com/gregorykucherov/spaced-seeds-for-metagenomics. CONTACT [email protected].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering metagenomic reads using spaced k-mers

With the emergence of next-generation sequencing technologies, the classification of short reads in a metagenomic sample has become an important yet difficult task. Several tools attempt to tackle this problem with each having a strong point in certain situations. Herein, a novel method is proposed that has its strong point in processing short reads. It is based on two new concepts: utilizing m...

متن کامل

Higher Classification Accuracy of Short Metagenomic Reads by Discriminative Spaced k-mers

The growing number of metagenomic studies in medicine and environmental sciences is creating new computational demands in the analysis of these very large datasets. We have recently proposed a timeefficient algorithm called Clark that can accurately classify metagenomic sequences against a set of reference genomes. The competitive advantage of Clark depends on the use of discriminative contiguo...

متن کامل

A Coverage Criterion for Spaced Seeds and Its Applications to Support Vector Machine String Kernels and k-Mer Distances

Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances, and to provide a lower misclassification rate when used with Support Vector Machines (SVMs). We confirm by independent experiments these two results, and propose in this article to use a coverage criterion to measure the seed efficiency in both cases in o...

متن کامل

A coverage criterion for spaced seeds and its applications to SVM string-kernels and k-mer distances

Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances (Boden et al., 2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower misclassification rate when used with Support Vector Machines (SVMs) (Onodera and Shibuya, 2013), We confirm by independent experiments these two results, and propo...

متن کامل

Amino Acid Classification and Hash Seeds for Homology Search

Spaced seeds have been extensively studied in the homology search field. A spaced seed can be regarded as a very special type of hash function on k-mers, where two k-mers have the same hash value if and only if they are identical at the w (w < k) positions designated by the seed. Spaced seeds substantially increased the homology search sensitivity. It is then a natural question to ask whether t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 31 22  شماره 

صفحات  -

تاریخ انتشار 2015